Search CORE

40 research outputs found

Blaeu: Mapping and navigating large tables with cluster analysis

Author: Cijvat C.P. (Robin)
Kersten M.L. (Martin)
Koopmanschap R.A. (Richard)
Sellam T.H.J. (Thibault)
Publication venue: 'VLDB Endowment'
Publication date: 05/09/2016
Field of study

Blaeu is an interactive database exploration tool. Its aim is to guide casual users through large data tables, ultimately triggering insights and serendipity. To do so, it relies on a double cluster analysis mechanism. It clusters the data vertically: it detects themes, groups of mutually dependent columns that highlight one aspect of the data. Then it clusters the data horizontally. For each theme, it produces a data map, an interactive visualization of the clusters in the table. The data maps summarize the data. They provide a visual synopsis of the clusters, as well as facilities to inspect their content and annotate them. But they also let the users navigate further. Our explorers can change the active set of columns or drill down into the clusters to refine their selection. Our prototype is fully operational, ready to deliver insights from complex databases

CWI's Institutional Repository

Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

Author: Cijvat C.P. (Robin)
Kersten M.L. (Martin)
Klau G.W. (Gunnar)
Manegold S. (Stefan)
Marschall T. (Tobias)
Schönhuth A. (Alexander)
Zhang Y. (Ying)
Publication venue
Publication date: 03/03/2015
Field of study

Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but results in terabytes of data to be stored and analysed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus genomes

CWI's Institutional Repository

Genome sequence analysis with MonetDB - A case study on Ebola virus diversity

Author: Cijvat C.P. (Robin)
Kersten M.L. (Martin)
Klau G.W. (Gunnar)
Manegold S. (Stefan)
Marschall T. (Tobias)
Schönhuth A. (Alexander)
Zhang Y. (Ying)
Publication venue: Springer Verlag
Publication date: 01/11/2015
Field of study

Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables \textit{easy}, \textit{flexible}, and \textit{rapid} management and analysis of sequence alignment data stored as Sequence Alignment/Map \\(SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus \\genomes

CWI's Institutional Repository

Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

Author: Alexander Schönhuth
Gunnar W Klau
Martin Kersten
Robin Cijvat
Stefan Manegold
Tobias Marschall
Ying Zhang
Publication venue
Publication date: 30/04/2020
Field of study

Abstract: Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but results in terabytes of data to be stored and analysed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus genomes

CiteSeerX

Computational pan-genomics: status, promises and challenges

Author: Abeel Thomas
Alkan Can
Baaijens Jasmijn
Bakker Paul
Boeva Valentina
Bonnal Raoul
Chiaromonte Francesca
Chikhi Rayan
Ciccarelli Francesca
Cijvat Robin
Datema Erwin
Dijkstra Louis
Duijn Cornelia
Dutilh Bas
Eichler Evan
El-Kebir Mohammed
Ernst Corinna
Eskin Eleazar
Garrison Erik
Ghaffaari Ali
Guryev Victor
Kersey Paul
Klau Gunnar
Kloosterman Wigard
Korbel Jan
Lameijer Eric-Wubbo
Langmead Benjamin
Marschall Tobias
Martin Marcel
Marz Manja
Medvedev Paul
Mu John
Mäkinen Veli
Neerincx Pieter
Novak Adam
Ouwens Klaasjan
Paten Benedict
Peterlongo Pierre
Pisanti Nadia
Porubsky David
Rahmann Sven
Raphael Benjamin
Reinert Knut
Ridder Dick
Ridder Jeroen
Rivals Eric
Sanders Ashley
Schlesner Matthias
Schulz-Trieglaff Ole
Schönhuth Alexander
Sheikhizadeh Siavash
Shneider Carl
Smit Sandra
The Computational Pan-Genomics Consortium
Valenzuela Daniel
Vandin Fabio
Wang Jiayin
Wessels Lodewyk
Ye Kai
Zhang Ying
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

EUR Research Repository

HAL-MINES ParisTech

Archivio della ricerca della Scuola Superiore Sant'Anna

Radboud Repository

HAL-Rennes 1

Computational pan-genomics: Status, promises and challenges

Author: Abeel T. (Thomas)
Alkan C. (Can)
Baaijens J.A. (Jasmijn)
Bakker P.I.W. (Paul) de
Boeva V. (Valentina)
Bonnal R.J.P. (Raoul)
Chiaromonte F. (Francesca)
Chikhi R. (Rayan)
Ciccarelli F.D. (Francesca)
Cijvat C.P. (Robin)
Datema E. (Erwin)
Dijkstra L.J. (Louis)
Duijn C.M. (Cornelia) van
Dutilh B.E. (Bas)
Eichler E.E. (Evan)
El-Kebir M. (Mohammed)
Ernst C. (Corinna)
Eskin E. (Eleazar)
Garrison E. (Erik)
Ghaffaari A. (Ali)
Guryev V. (Victor)
Kersey P. (Paul)
Klau G.W. (Gunnar)
Kloosterman W.P. (Wigard)
Korbel J.O. (Jan)
Lameijer E.-W. (Eric-Wubbo)
Langmead B. (Benjamin)
Marschall T. (Tobias)
Martin M. (Marcel)
Marz M. (Manja)
Medvedev P. (Paul)
Mu J.C. (John)
Mäkinen V. (Veli)
Neerincx P.B.T. (Pieter)
Novak A.M. (Adam)
Ouwens K. (Klaasjan)
Paten B. (Benedict)
Peterlongo P. (Pierre)
Pisanti N. (Nadia)
Porubsky D. (David)
Rahmann S. (Sven)
Raphael B.J. (Benjamin)
Reinert K. (Knut)
Ridder D. (Dick) de
Ridder J. (Jeroen) de
Rivals E. (Eric)
Sanders A.D. (Ashley)
Schlesner M. (Matthias)
Schulz-Trieglaff O. (Ole)
Schönhuth A. (Alexander)
Sheikhizadeh S. (Siavash)
Shneider C. (Carl)
Smit S. (Sandra)
The Computational Pan-Genomics Consortium
Valenzuela D. (Daniel)
Vandin F. (Fabio)
Wang J. (Jiayin)
Wessels L.F.A. (Lodewyk)
Ye K. (Kai)
Zhang Y. (Ying)
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different Computational methods and paradigms are needed.We will witness the rapid extension of Computational pan-genomics, a new sub-area of research in Computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a Computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations

CWI's Institutional Repository

Erasmus University Digital Repository

Bridging the gap between Big Genome Data Analysis and Database Management Systems

Author: Cijvat Robin
Publication venue: Utrecht University - Department of Information and Computing Sciences
Publication date: 01/02/2014
Field of study

The bioinformatics field has encountered a data deluge over the last years, due to in- creasing speed and decreasing cost of DNA sequencing technology. Today, sequencing the DNA of a single genome only takes about a week, and it can result in up to a ter- abyte of data. The sequencing data are usually stored in files, and specialized tools have been designed to analyze and manage them. Despite of these tools, bioinformaticians are still exposed to many data management hurdles when analyzing these files, which often leads to excessively time consuming tasks. In this thesis, we accurately map the needs of bioinformaticians by defining a set of use cases that reflect the everyday analysis that is applied on genetic data. We propose a modern-DBMS based approach, to analyze and manage genetic data file repositories. We identify the pros and cons of this method compared to the traditional file-based approach. Additionally, we experimented with a novel in-situ approach, where the DBMS ap- plies Just-In-Time ETL (Extract-Transform-Load) on the original files instead of loading all data from these files up front. A major advantage of this approach is that it greatly reduces the data-to-query time, since not all data are loaded in the DBMS during initial- ization. Other advantages include the decrease in storage requirements and the reduced data duplication. With this project, we have taken the first step towards the adaptation of the state-of- the-art database technology to accelerate genetic data analytics. The preliminary results presented in this thesis are highly promising and they open up a plethora of new research opportunities

CWI's Institutional Repository